CUDA 執行模型將您的電腦轉換為高效率的異質系統。想像一位 總指揮官(主機/CPU) 與一群 千人部隊(設備/GPU)。總指揮官負責複雜的邏輯與決策,而千人部隊則同時執行龐大的重複性任務。
1. 結構上的差異
主機 主機 是針對延遲優化的中央處理器,專為複雜的控制流程與串列任務設計。相反地, 設備 設備是針對吞吐量優化的圖形處理器,內含數以千計的簡單核心,可同時在龐大的資料集上執行相同的指令。
2. 執行節奏
CUDA 程式運作於一系列階段中。執行從主機開始處理「串列程式碼」。當程式遇到「平行核心」時,會在設備上啟動一個 網格 的線程網格。一旦設備完成其龐大的工作負載,控制權便回歸至主機。
3. 性能專精
此模型善用兩者的優勢:中央處理器管理系統資源與複雜分支,而圖形處理器則執行 SPMD(單一程式、多資料) 邏輯以平行方式處理資料元素。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which architecture is characterized as being 'throughput-optimized'?
The Host (Intel® CPU)
The Device (NVIDIA® GPU)
The System RAM
The PCIe Bus
✅ Correct!
Correct! GPUs are designed to maximize the total amount of work (throughput) done per unit of time by processing thousands of data points simultaneously.❌ Incorrect
The Host (CPU) is 'latency-optimized' to minimize the time a single thread takes to execute.QUESTION 2
The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.
float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);
float Nd, Pd; malloc(&Nd, size); ... free(Nd);
float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;
int Nd, Pd; Nd = new float[size]; ... free(Nd);
✅ Correct!
Exactly. You must declare pointers for the device, use cudaMalloc with a double-pointer cast, and use cudaFree to release the memory.❌ Incorrect
Standard C malloc/free or C++ new/delete cannot be used to manage Device (GPU) memory.QUESTION 3
In the CUDA execution model, where does a program always begin its execution?
On the Device (GPU)
Simultaneously on both
On the Host (CPU)
In the Global Memory
✅ Correct!
Correct. Execution starts with the serial code on the Host (CPU).❌ Incorrect
The GPU only begins work when a Kernel is specifically launched by the Host.QUESTION 4
What happens when the Host encounters a phase with rich data parallelism?
It speeds up its clock frequency.
It launches a Kernel onto the Device.
It stores the data in the Host Cache.
It converts the code to Python.
✅ Correct!
Yes! The Host 'offloads' the parallel work by launching a kernel on the massive core array of the GPU.❌ Incorrect
The CPU is not optimized for massive data parallelism; it offloads such work to the Device.QUESTION 5
A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?
The G80 cannot handle 1024 blocks.
The total number of threads exceeds 1 million.
The configuration results in 1024 threads per block, exceeding the 512 hardware limit.
Matrix multiplication is not data parallel.
✅ Correct!
Precisely. 1,048,576 elements divided by 1024 blocks results in 1024 threads per block, which exceeds the G80 architecture limit of 512.❌ Incorrect
Check the thread-per-block limit for the G80 architecture: it is 512.Case Study: High-Resolution Fluid Dynamics
Optimizing a Heterogeneous Simulation
You are developing a fluid dynamics engine. The simulation involves: (A) Calculating the user interface and file logging, (B) Computing the pressure gradients for 20 million fluid cells, and (C) Updating the simulation time-step based on global convergence tests. You must decide how to map these tasks to the CUDA execution model.
Q
1. Which task (A, B, or C) should definitely remain on the Host, and why?
Solution:
Task A (UI and Logging) and Task C (Time-step logic) should remain on the Host. These tasks are serial in nature, involve complex I/O and control logic, and do not benefit from throughput optimization. The Host is designed to minimize the latency of these single-threaded tasks.
Task A (UI and Logging) and Task C (Time-step logic) should remain on the Host. These tasks are serial in nature, involve complex I/O and control logic, and do not benefit from throughput optimization. The Host is designed to minimize the latency of these single-threaded tasks.
Q
2. How does the 'alternating phases' concept apply to the interaction between tasks B and C?
Solution:
The program enters a loop where the Host launches Task B (the parallel pressure kernel) on the Device. Once Task B completes (synchronization), control returns to the Host to perform Task C (serial convergence check). This repeats for every time-step in the simulation.
The program enters a loop where the Host launches Task B (the parallel pressure kernel) on the Device. Once Task B completes (synchronization), control returns to the Host to perform Task C (serial convergence check). This repeats for every time-step in the simulation.